Author

Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder

Linear Regression

Data Visualization

Logarithmic- Visualization

Code
lineargraph = ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +
  geom_point(alpha = 0.1, color = "orange") +
  labs(
    x = "Ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Relationship between Ln(GDP per capita) and Life Expectancy"
  ) +
  geom_smooth(method = "lm")
lineargraph

Code
logmodel = lm(Life_Expectancy ~ lGDP, data = joined_data)
tidy(logmodel)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    22.8     0.372       61.4       0
2 lGDP            5.30    0.0445     119.        0
Code
summary(logmodel)$r.squared
[1] 0.6078248

\[\widehat{Life Expectancy} = 22.84 + 5.312lGDP\] The logarithmic transformation of GDP per capita makes a linear model fit quite well. The \(R^2\) is moderate, which is pretty good considering how many factors can influence life expectancy across countries and across time. This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean \(LifeExpectancy\) is 22.84 years.

From here on out, we will examine only the relationship between \(lGDP\) and \(Life Expectancy\).

Changes Over Time

Code
years = 1959:2019
lms_year = matrix(nrow = length(years), ncol = 2)
lms_year[, 1] = years
coef_year = matrix(nrow = length(years), ncol = 3)
coef_year[, 1] = years
for (y in years) {
  year_data = joined_data |>
    filter(Year == y)
  year_model = lm(Life_Expectancy ~ lGDP, data = year_data)
  lms_year[y - 1958, 2] = summary(year_model)$r.squared
  coef_year[y - 1958, 2] = year_model$coefficients[1]
  coef_year[y - 1958, 3] = year_model$coefficients[2]
}
lms_year = data.frame(lms_year)
colnames(lms_year) = c("Year", "Rsquared")
coef_year = data.frame(coef_year)
colnames(coef_year) = c("Year", "Beta0", "Beta1")

ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +
  geom_line(color = "steelblue") +
  ylim(0, 1) +
  labs(y = "",
       subtitle = "R-Squared",
       title = "R-Squared of ln(GDP per capita) and Life Expectancy over time")

While variation in \(lGDP\) has consistently explained about 60% of variation in \(LifeExpectancy\), the nature of this relationship has changed significantly over time.

Code
library(patchwork)
beta0 = ggplot(data = coef_year, aes(x = Year, y=Beta0)) +
  geom_line(color = "steelblue") +
  labs(y = "",
       subtitle = "Estimated Intercept",
       title = "Estimated Intercept Over Time")

beta1 = ggplot(data = coef_year, aes(x = Year, y=Beta1)) +
  geom_line(color = "steelblue") +
  labs(y = "",
       subtitle = "Estimated Slope",
       title = "Estimated Slope of ln(GDP per capita) Over Time")
beta0+beta1

Differences Between Countries

The relationship is pretty stable over time and has been trending upward for the past decade. Outliers can be seen in the noticeable dips in the \(R^2\). These low outliers could also be seen on the previous graphs.

Code
joined_data <- joined_data |>
  mutate(residuals = logmodel$residuals)
outliers <- joined_data |>
  filter(abs(residuals) > 20) |>
  mutate(lGDP = round(lGDP, digits = 2),
         residuals = round(residuals, digits = 2))
kable(
  outliers,
  col.names = c(
    "Year",
    "Country",
    "GDP per Capita",
    "Life Expectancy",
    "ln(GDP per Capita)",
    "Residual"
  )
) |>
  kable_material_dark("hover")
Year Country GDP per Capita Life Expectancy ln(GDP per Capita) Residual
1959 Cameroon 932 27.5 6.84 -31.59
1959 China 238 27.8 5.47 -24.05
1960 Cameroon 924 26.4 6.83 -32.64
1971 Burundi 331 17.1 5.80 -36.50
1993 Rwanda 216 9.5 5.38 -41.84
1998 Botswana 4570 46.4 8.43 -21.12
1999 Botswana 4570 45.5 8.43 -22.02
2000 Botswana 4490 45.0 8.41 -22.42
2001 Botswana 4680 44.9 8.45 -22.74
2002 Botswana 4820 45.7 8.48 -22.10
2002 Eswatini 2660 43.9 7.89 -20.75
2003 Botswana 4860 47.5 8.49 -20.34
2003 Eswatini 2750 43.1 7.92 -21.73
2004 Eswatini 2900 43.3 7.97 -21.81
2005 Eswatini 3060 44.1 8.03 -21.29
2006 Eswatini 3180 44.9 8.06 -20.70
2007 Eswatini 3180 45.5 8.06 -20.10
2009 Haiti 1300 32.5 7.17 -28.35

These high residual data points are all on the low end. They represent the following events: Cameroonian War of Independence and following Civil War

Great Leap Forward and resulting famine

Conflict between Hutus and Tutsis in Burundi and Rwanda (culminating in the 1993 Rwandan Genocide)

The AIDS crisis in Southern Africa as seen in Botswana and Eswatini

A series of hurricanes that hit Haiti in 2009

Code
countries = joined_data |>
  select(country) |>
  distinct()

lms_country = matrix(nrow = length(countries), ncol = 2)
lms_country[, 1] = countries
lms_country = data.frame(lms_country)
colnames(lms_country) = c("Country", "Rsquared")
idnum = 1
for (c in countries$country) {
  country_data = joined_data |>
    filter(country == c)
  country_model = lm(Life_Expectancy ~ lGDP, data = country_data)
  lms_country[idnum , "Rsquared"] = summary(country_model)$r.squared
  idnum = idnum + 1
}
Code
ggplot(data = lms_country, aes(x = Rsquared)) +
  geom_histogram(binwidth = 0.1, fill="steelblue") +
  labs(
    x = "R-squared",
    y = "",
    subtitle = "Count",
    title = "Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy"
  )

While the relationship between \(lGDP\) and \(Life Expectancy\) is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between \(lGDP\) and \(Life Expectancy\) until the AIDS crisis hit. This drop in \(Life Expectancy\) while \(lGDP\) continued to grow is responsible for its \(R^2\) of 0.0013.

Code
joined_data |>
  filter(country == "Botswana") |>
  ggplot(aes(x = lGDP, y = Life_Expectancy)) +
  geom_point() +
  labs(
    x = "ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Life Expectancy and ln(GDP per capita) in Botswana"
  ) +
  geom_smooth(method = "lm")

Code
lms_country |>
  mutate(Rsquared = round(Rsquared, digits = 3)) |>
  slice_min(order_by = Rsquared, n = 8) |>
  kable() |>
  kable_material_dark("hover")
Country Rsquared
Liberia 0.000
Botswana 0.001
Mauritania 0.002
Yemen 0.002
Gambia 0.003
Sierra Leone 0.004
Central African Republic 0.015
Solomon Islands 0.018
Code
joined_data |>
  filter(country == "Germany") |>
  ggplot(aes(x = lGDP, y = Life_Expectancy)) +
  geom_point() +
  labs(
    x = "ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Life Expectancy and ln(GDP per capita) in Germany"
  ) +
  geom_smooth(method = "lm")

In countries without life expectancy-decreasing events since 1959, there is a very strong linear relationship. In the case Germany, the \(R^2\) is 0.983.

Code
lms_country |>
  mutate(Rsquared = round(Rsquared, digits = 3)) |>
  slice_max(order_by = Rsquared, n = 8) |>
  kable() |>
  kable_material_dark("hover")
Country Rsquared
South Korea 0.983
Germany 0.983
Sao Tome and Principe 0.983
Lao 0.980
Colombia 0.978
Hong Kong, China 0.978
Djibouti 0.978
Egypt 0.974

Linear Regression

1. Linearity

Code
lineargraph

Even after the log transformation of \(GDP\), the data is still a bit nonlinear.

2. Random Sampling/ No Autocorrelation

The assumption of random sampling isn’t met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags, so this assumption isn’t met either. The p-value of the autocorrelation is 0.

Code
acf(joined_data$residuals, type = "correlation")

Code
library(lmtest)
dwtest(logmodel)

    Durbin-Watson test

data:  logmodel
DW = 1.6389, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

3. No colinearity

Since there is only one explanatory variable (\(lGDP\)), there can’t be any colinearity.

4. Mean Independence

Code
ggplot(data = joined_data, aes(x = lGDP, y = residuals)) +
  geom_point(alpha = 0.15, color = "steelblue") +
  labs(
    x = "ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy Residual",
    title = "Residuals Plot"
  )

The assumption of mean independence isn’t met because the explanatory variable can be used to predict the residuals. For high values of \(lGDP\), the residuals are all below 0 for example.

5. Homoskedasticity

Code
library(AER)
coeftest(logmodel, vcovHC(logmodel))

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 22.84047    0.37353  61.147 < 2.2e-16 ***
lGDP         5.30161    0.04237 125.127 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
bptest(logmodel)

    studentized Breusch-Pagan test

data:  logmodel
BP = 301.4, df = 1, p-value < 2.2e-16

\[BP = n \times R^2_{\hat u^2} = 301.4 \] \[9162 \times R^2_{\hat u^2} = 301.4\] \[R^2_{\hat u^2} = 0.033\] While the data isn’t homoskedastic, the Breush-Pagan test statistic shows that only 3.3% of the variance in \(u_i\) can be explained by the variance in \(lGDP\). We can say this because \(var(u_i) = E(u^2_i) - E(u_i)^2\) and \(E(u_i)=0\).

6. Normality of \(u\)

Code
ggplot(data = joined_data, aes(x = residuals)) +
  geom_histogram(fill = "steelblue") +
  labs(
    y = "",
    subtitle = "Count",
    x = "Residuals",
    title = "Histogram of Residuals"
  )

While the distribution is skewed left, it isn’t severe. This is because of the low outliers mentioned earlier.

Estimated Model

As shown previously, the estimated linear regression model is: \[\widehat {Life Expectancy} = 22.84 + 5.312lGDP\] This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean \(LifeExpectancy\) is 22.84 years.

Model Fit

Code
variance_table = matrix(nrow = 1, ncol = 3)
colnames(variance_table) = c("Variance of Response",
                             "Response of Fitted Values",
                             "Variance of Residuals")
variance_table[1, 1] = var(joined_data$Life_Expectancy)
variance_table[1, 2] = var(logmodel$fitted.values)
variance_table[1, 3] = var(logmodel$residuals)
variance_table[1,] = round(variance_table[1,], digits = 2)
kable(variance_table) |>
  kable_material_dark()
Variance of Response Response of Fitted Values Variance of Residuals
97.63 59.34 38.29

\[R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608\] The variance in \(lGDP\) accounts for 60.7% of the variance in \(Life Expectancy\). This is a moderate \(R^2\), so this model does a decent job of explaining the response. Considering how many factors may influence \(Life Expectancy\), this is a good quality model.

Code
set.seed(1738)
sigma = sigma(logmodel)
predictions = predict(logmodel)
simulation <- function(x, rse) {
  x <- x + rnorm(n = length(x), mean = 0, sd = rse)
  return(x)
}
actualhist = ggplot(data = joined_data, aes(x = Life_Expectancy)) +
  geom_histogram(binwidth = 2, fill = "steelblue")
simhist = ggplot(data = joined_data, aes(x = simulation(predictions, sigma))) +
  geom_histogram(binwidth = 2, fill = "steelblue")
actualhist + simhist

Because of the left skew of the data, the simulation distribution doesn’t exactly match the actual distribution.

Code
sims = 1:1000
sim_data = matrix(nrow = length(sims), ncol = 2)
sim_data[, 1] = sims
for(num in sims) {
  x = predictions + rnorm(n=length(joined_data$lGDP), mean = 0, sd = 6.188)
  simmodel = lm(joined_data$Life_Expectancy ~ x)
  sim_data[num, 2] = summary(simmodel)$r.squared
}
hist(sim_data[,2])

DO THIS WITH MAP

Code
library(gganimate)
library(gifski)
ggplot(data=joined_data, aes(x=lGDP, y=Life_Expectancy, color=country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  labs(title = 'Year: {frame_time}', x = 'ln(GDP per capita)', y = 'Life Expectancy') +
  transition_time(Year) +
  ease_aes('linear')